Project Motivation: Investigate the impact of pandemic restrictions on hockey player development.
# load the data
library(readr)
sams_ohl_data_request <- read_csv("sams_ohl_data_request.csv")
# filter years
recent <- sams_ohl_data_request %>%
filter(season %in% c("2019-2020", "2020-2021", "2021-2022"))
# add points per game columns
recent <- recent %>%
mutate(ppg = g/gp)
# add drafted y/n column
recent <- recent %>%
mutate(got_drafted = case_when(!is.na(draft_year) ~ 'Yes',
TRUE ~ 'No'))
# yes is only plays drafted before COVID
recent2 <- recent %>%
mutate(got_drafted = case_when(!is.na(draft_year) & draft_year < 2020
~ 'Yes',
TRUE ~ 'No'))
# make dataframe of drafted players
drafted <- recent %>%
filter(!is.na(draft_year)) %>%
filter(draft_year < 2020)
rank2 <- recent %>%
filter(league == "OHL") %>%
group_by(team_name, season) %>%
arrange(team_name, season, desc(pm)) %>%
mutate(rank = 1:n()) %>%
ungroup()
ohl <- read_csv("sams_ohl_data_request.csv")
ohl_covid <- ohl %>%
filter(season %in% c("2019-2020","2020-2021", "2021-2022"))
## Add treatment var:
# Filtering only for players who played more than 10 games (should we combine number of games played across leagues?)
ohl_20_21 <- ohl %>%
filter(season == "2020-2021", gp > 10)
# Split players up into treatment vs non-treatment
ohl_treatment <- ohl %>%
filter(season %in% c("2019-2020", "2021-2022"), league == "OHL") %>%
group_by(player_id, season) %>%
mutate(ppg = sum(pts)/sum(gp),
treatment = ifelse(player_id %in% ohl_20_21$player_id, "Played", "Didn't play")
) %>%
ungroup()
plyr_quality <- ohl %>%
filter(season == "2019-2020", league == "OHL") %>%
group_by(player_id) %>%
mutate(plyr_quality = sum(pts)/sum(gp)) %>%
filter(duplicated(player_id) == FALSE) %>%
ungroup() %>%
select(player_id, plyr_quality)
ohl_pm <- ohl %>%
filter(season %in% c("2019-2020", "2021-2022"), league == "OHL") %>%
group_by(team_name, season) %>%
arrange(team_name, season, desc(pm)) %>%
mutate(pm_rank = 1:n(),
pm_mean = mean(pm),
pm_relative = pm - mean(pm)
) %>%
ungroup()
# We need to check that each player played in the OHL in 2019-2020 and 2021-2022
ohl_szn <- ohl_pm %>%
group_by(player_id) %>%
mutate(played_2019 = season == "2019-2020",
played_2021 = season == "2021-2022",
played_both_szn = sum(played_2019) & sum(played_2021)) %>%
ungroup() %>%
filter(played_both_szn == TRUE)
# add treatment variable through join
ohl_trt <- ohl_szn %>%
left_join(select(ohl_treatment, player_id, season, team_name, treatment), by = c("player_id", "team_name", "season")) %>%
mutate(ppg = pts/gp)
# add age variable
ohl_age <- ohl_trt %>%
group_by(player_id, season) %>%
mutate(year = strsplit(season, "-")[[1]][1]) %>%
mutate(age = trunc((dob %--% as.Date(paste0(year, "-9-15"))) / years(1)),
age_continuous = (dob %--% as.Date(paste0(year, "-09-15"))) / years(1)
) %>%
ungroup()
# add player quality
ohl_qlty <- ohl_age %>%
left_join(plyr_quality, by = "player_id") %>%
group_by(player_id, season) %>%
mutate(ppg_total = sum(pts)/sum(gp),
gp_total = sum(gp),
pts_total = sum(pts)) %>%
arrange(player_id, season, -plyr_quality) %>%
filter(duplicated(player_id) == FALSE) %>%
ungroup()
# were they drafted
ohl_filtered <- ohl_qlty %>%
mutate(drafted = ifelse(!is.na(draft_year) == TRUE, "Yes", "No")) %>%
select(season, player_id, treatment, first_name, last_name, position, plyr_quality, age, gp_total, pts_total, ppg_total, drafted, draft_year, round, overall_pick_num, age_continuous, pm_relative, pm_rank, pm)
# just post-COVID stats
ohl_treatment_21_22 <- ohl_filtered %>%
filter(season == "2021-2022")
Our dataset is comprised of players who played in the Ontario Hockey League (OHL) during the 2019-2020, 2020-2021, and 2021-2022 seasons. The dataset also includes information regarding other leagues they have played in at during their career. There is information regarding season, team, league, points, games played, position, and drafted status. Each row is a player on a certain team in a specific season.
Filtering (unless otherwise specified):
2019-2020 (“pre-COVID”) and 2021-2022 (“post-COVID”) seasons
league == OHL
Only players who played in the OHL during both pre- and post-COVID seasons
Variables added:
points per game (PPG) per season (combined if a player played for multiple teams in a season)
games played (GP) per season (combined if a player played for multiple teams in a season)
treatment (i.e. whether a player played more than 10 games during the COVID season)
age (approximately the oldest a player was in a given season)
player quality approximated by plus-minus (PM) relative to average team PM in pre-COVID season
whether a player was drafted (not totally up to date)
all players includes any player with data from the 2019-2020, 2020-2021 and/or 2021-2022 seasons
Snippet of Raw Data:
recent %>%
dplyr::select(name, team_name, season, league, position,
gp, g, a, pts, pm) %>%
head(5) %>%
knitr::kable()%>%
kable_styling("striped")
| name | team_name | season | league | position | gp | g | a | pts | pm |
|---|---|---|---|---|---|---|---|---|---|
| Michael Renwick | Hamilton Bulldogs | 2019-2020 | OHL | D | 59 | 3 | 14 | 17 | 2 |
| Michael Renwick | Windsor Spitfires | 2021-2022 | OHL | D | 67 | 11 | 18 | 29 | 28 |
| Grayson Tiller | Barrie Colts | 2021-2022 | OHL | D | 39 | 1 | 5 | 6 | -3 |
| Kasper Larsen | Mississauga Steelheads | 2021-2022 | OHL | D | 57 | 8 | 35 | 43 | 12 |
| Landen Hookey | Soo Greyhounds | 2021-2022 | OHL | F | 43 | 1 | 4 | 5 | -2 |
Player Example:
recent %>%
dplyr::select(name, team_name, season, league, position,
gp, g, a, pts, pm) %>%
filter(name == "Brett Harrison") %>%
arrange(season) %>%
knitr::kable() %>%
kable_styling("striped")
| name | team_name | season | league | position | gp | g | a | pts | pm |
|---|---|---|---|---|---|---|---|---|---|
| Brett Harrison | Oshawa Generals | 2019-2020 | OHL | F | 58 | 21 | 16 | 37 | -21 |
| Brett Harrison | Canada White U17 | 2019-2020 | WHC-17 | F | 6 | 2 | 0 | 2 | NA |
| Brett Harrison | Canada U18 | 2020-2021 | WJC-18 | F | 7 | 2 | 0 | 2 | 4 |
| Brett Harrison | KOOVEE | 2020-2021 | Mestis | F | 1 | 0 | 0 | 0 | 0 |
| Brett Harrison | KOOVEE U20 | 2020-2021 | U20 SM-sarja | F | 7 | 4 | 5 | 9 | -1 |
| Brett Harrison | Oshawa Generals | 2021-2022 | OHL | F | 65 | 27 | 34 | 61 | -15 |
# treatment vs ppg
ggplot(ohl_filtered, aes(x = ppg_total, color = treatment)) +
geom_density() +
labs(title = "Players who played during COVID season generally had higher PPG in\n2019-20 and 2021-22 seasons",
x = "PPG",
color = "COVID season") +
scale_color_manual(values = c("royalblue3", "darkgoldenrod3")) +
theme_bw()
# age vs ppg
ggplot(ohl_filtered, aes(x = age, y = ppg_total)) +
geom_point(alpha = .5) +
labs(title = "Positive linear relationship between PPG and Age in \n2019-20 and 2021-22 seasons",
x = "Age",
y = "PPG") +
geom_smooth(method = "lm") +
theme_bw()
# gp vs ppg
ggplot(ohl_filtered, aes(x = gp_total, y = ppg_total)) +
geom_point(alpha = .5) +
labs(title = "Positive non-linear relationship between PPG and GP in \n2019-20 and 2021-22 seasons",
x = "GP",
y = "PPG") +
theme_bw()
# drafted vs ppg
ggplot(ohl_filtered) +
geom_density(aes(x = ppg_total, color = drafted)) +
labs(title = "PPG generally greater for forwards in 2019-20 and 2021-22 seasons",
x = "PPG",
color = "Drafted?") +
scale_color_manual(values = c("royalblue3", "darkgoldenrod3")) +
theme_bw()
# pm vs ppg
ggplot(ohl_filtered, aes(x = pm_relative, y = ppg_total)) +
geom_point(alpha = .5) +
labs(title = "Positive linear relationship between PPG and relative PM in \n2019-20 and 2021-22 seasons",
x = "Relative PM",
y = "PPG") +
geom_smooth(method = "lm") +
theme_bw()
# position vs ppg
ggplot(ohl_filtered) +
geom_density(aes(x = ppg_total, color = position)) +
labs(title = "PPG generally greater for forwards in 2019-20 and 2021-22 seasons",
x = "PPG",
color = "Position") +
scale_color_manual(values = c("royalblue3", "darkgoldenrod3")) +
theme_bw()
# season vs ppg
ggplot(ohl_filtered) +
geom_density(aes(x = ppg_total, color = season)) +
labs(title = "PPG in 2021-22 season generally higher than in 2019-2020 season",
x = "PPG",
color = "Season") +
scale_color_manual(values = c("royalblue3", "darkgoldenrod3")) +
theme_bw()
# drafted status vs treatment
mosaicplot(table("Drafted" = ohl_filtered$drafted,
"COVID Season" = ohl_filtered$treatment),
main = "Drafted players more likely to have played during COVID",
shade = TRUE)
# pm vs treatment
ggplot(ohl_filtered, aes(x = pm_relative, color = treatment)) +
geom_density() +
labs(title = "Insignificant difference in relative PM for players who played during COVID season",
x = "relative PM",
color = "COVID Season") +
scale_color_manual(values = c("royalblue3", "darkgoldenrod3")) +
theme_bw()
# age vs treatment
ggplot(ohl_filtered, aes(x = age, color = treatment)) +
geom_density() +
labs(title = "Insignificant difference in age for players who played during COVID season",
x = "Age",
color = "COVID Season") +
scale_color_manual(values = c("royalblue3", "darkgoldenrod3")) +
theme_bw()
# gp vs treatment
ggplot(ohl_filtered, aes(x = gp_total, color = treatment)) +
geom_density() +
labs(title = "Players who played during COVID season generally played more games",
x = "GP",
color = "COVID season") +
scale_color_manual(values = c("royalblue3", "darkgoldenrod3")) +
theme_bw()
ppg_ohl <- recent %>%
filter(league == "OHL") %>%
filter(ppg > 0) %>%
ggplot(aes(x = gp, y = ppg)) +
geom_point(aes(color = got_drafted), alpha = 0.5, size = 1) +
labs(title = "OHL", x = "Games Played", y = "Points Per Game",
color = "Drafted Before \n 2020-2021 Season") +
scale_color_manual(values = c("royalblue3", "darkgoldenrod3")) +
theme_bw() +
theme(legend.position = "none")
ppg_all <- recent %>%
filter(ppg > 0) %>%
ggplot(aes(x = gp, y = ppg)) +
geom_point(aes(color = got_drafted), alpha = 0.5, size =1) +
labs(title = "All Leagues", x = "Games Played", y = "Points Per Game",
color = "Drafted Before \n 2020-2021 Season") +
scale_color_manual(values = c("royalblue3", "darkgoldenrod3")) +
theme_bw()
ppg_ohl + ppg_all
People that played less games had a higher points per game than those with greater game counts.
Draft Status “yes” refers to players drafted before the 2020-2021 season.
mosaicplot(table(recent$got_drafted, recent$season), shade = TRUE,
main = element_blank())
During the 2020-2021 season the majority of players had already previously been drafted. The following season was dominated by undrafted players in this dataset.
# draft pick number on x-axis, gp on y-axis
drafted %>%
ggplot(aes(x = overall_pick_num, y = gp)) +
geom_point(alpha = 0.5) +
geom_smooth() +
labs(x = "Draft Pick Number", y = "Games Played", caption = "Any player that was drafted, regardless of draft year.") +
theme_bw()
Players picked later in the draft recorded more games.
recent %>%
ggplot(aes(x = gp)) +
geom_density(aes(color = factor(got_drafted)), alpha = 0.3) +
labs(x = "Games Played", y = "Density", color = "Draft Status",
caption = "All players.") +
scale_color_manual(values = c("royalblue3", "darkgoldenrod3")) +
theme_bw() +
facet_wrap(~season)
According to this graph, during the 2020-2021 season drafted players only played slightly more games than undrafted players. In the 2021-2022 undrafted players particpated in more games.
In the following graphs, the blue line acts as a reference line at y = 0. The gold line is the team’s average plus-minus.
rank2 %>%
filter(team_name == "Ottawa 67's") %>%
ggplot(aes(x = player_id, y = pm)) +
geom_point() +
geom_hline(aes(yintercept = mean(pm)), color = "darkgoldenrod3") +
geom_hline(aes(yintercept = 0), color = "royalblue3") +
labs(title = "Ottawa 67's", subtitle = "Highest Player Plus-Minus",
y = "Plus-Minus") +
theme_bw() +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(),
axis.title.x = element_blank()) +
facet_grid(~season)
The Ottawa 67’s contain the individual with the highest plus-minus in the Ontario Hockey League. Overall the team had much lower numbers in the 2021-2022 season. A singular player was not below the mean.
rank2 %>%
filter(team_name == "North Bay Battalion") %>%
ggplot(aes(x = player_id, y = pm)) +
geom_point() +
geom_hline(aes(yintercept = mean(pm)), color = "darkgoldenrod3") +
geom_hline(aes(yintercept = 0), color = "royalblue3") +
labs(title = "North Bay Battalion", subtitle = "Lowest Player Plus-Minus",
y = "Plus-Minus") +
theme_bw() +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(),
axis.title.x = element_blank()) +
facet_grid(~season)
The North Bay Battalion possess the player with the lowest plus-minus in the Ontario Hockey League. This team greatly improved in the later season. The majority switched to above the mean, opposite of the previous season.
rank2 %>%
filter(team_name == "Hamilton Bulldogs") %>%
ggplot(aes(x = player_id, y = pm)) +
geom_point() +
geom_hline(aes(yintercept = mean(pm)), color = "darkgoldenrod3") +
geom_hline(aes(yintercept = 0), color = "royalblue3") +
labs(title = "Hamilton Bulldogs", subtitle = "Median Player Plus-Minus",
y = "Plus-Minus") +
theme_bw() +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(),
axis.title.x = element_blank()) +
facet_grid(~season)
The mean plus-minus across the OHL is -1. The Hamilton Bulldogs is one of the teams with the highest number of players with that score. This graph looks highly similar to the North Bay Battalion.
teampm <- recent %>%
filter(league == "OHL") %>%
group_by(team_name, season) %>%
arrange(team_name, season, desc(pm)) %>%
summarize(team_pm = mean(pm))
teampm %>%
ggplot(aes(x = team_name, y = team_pm)) +
geom_point() +
geom_hline(aes(yintercept = mean(team_pm)), color = "darkgoldenrod3") +
geom_hline(aes(yintercept = 0), color = "royalblue3") +
labs(title = "Team Plus-Minus", y = "Plus-Minus") +
theme_bw() +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(),
axis.title.x = element_blank()) +
facet_grid(~season)
The distributions of team plus-minus scores in the 2019-2020 and 2021-2022 are alike.
Before performing causal analysis, we wanted to determine whether playing during the COVID season had a significant effect on PPG when controlling for variables (and interactions between variables) suspected to be associated with the response through EDA. Because PPG is right skewed and bounded between 0 and some positive number, the assumptions of ordinary least squares regression are not met. We performed Gamma regression with the log link function to more accurately model the relationship between our response and explanatory variables.
Initial model: ppg_alt ~ position * pm_relative + treatment * drafted + gp_total + pm_relative * gp_total + pm_relative * drafted + age_continuous * pm_relative + season
ohl_filtered <- ohl_filtered %>%
mutate(ppg_alt = ppg_total + .001)
ohl_glm_interaction <- glm(ppg_alt ~ position*pm_relative + treatment*drafted + drafted*season + gp_total + pm_relative*gp_total + pm_relative*drafted + age_continuous*pm_relative + season,
data = ohl_filtered,
family = Gamma(log))
plot(ohl_glm_interaction, which = 1)
#summary(ohl_glm_interaction)
\(ln(\widehat{\text{PPG} + .001}) = -5.983 + .602*\text{Forward} + .00412*\text{Relative PM} + .0622*\text{Played during COVID} + .365*\text{Drafted} + .0174* \text{GP} + .204*\text{Age} \\ + .0266*\text{Season} + .000635*\text{Forward}*\text{Relative PM} + .338*\text{Played during COVID} * \text{Drafted} -.0000285*\text{relative PM}*\text{GP}\\ + .00570*\text{relative PM}*\text{Drafted} + .000470*\text{Relative PM}*\text{Age}\)
Variables found to be significant at the \(\alpha = .001\) level in our model:
Position
Relative PM
Drafted status
Games played
Age
Variables found to be significant at the \(\alpha = .05\) level in our model:
Model limitations:
Player Performance:
Defined as \(PPG = \frac{(points + assists)}{games \hspace{.1 cm} played}\)
May not adequately capture all aspects of a player’s performance
Though we control for position in our model, PPG will generally be higher for forwards than defensemen
Defensive performance is not well-approximated by PPG, and this measure would bias offensive defensemen as the best-performing defensemen.
Good defensive forwards may be undervalued using this metric
A more nuanced measure that incorporates important defensive statistics like turnovers created, passes completed, shots blocked, etc. may better approximate player performance.
Player quality
Defined by \(player quality = PM_{relative} = PM_{player} - \mu_{team\hspace{.05cm}PM}\)
Plus minus does not account for all of the statistics that might measure the quality of a player (shots blocked, turnovers created, passes completed, etc)
Does not consider factors like sheltering
Treatment Variable:
Defined as players who participated in more than 10 games during the COVID season in a single league or tournament.
We did not control for quality of league.
Though typically only the best players are selected to play in championships/tournaments, many of them were not included as treated in our analysis because of their low game count.
Conversely, players who played only a lot of games games in mediocre leagues may have been counted as treated.
Skaters who played more than 10 games cumulatively across two or more different leagues would also not be considered treated, though they played a significant number of games.
Future analysis could provide a more distinct view of the treatment variable in which both league quality and number of games played are controlled for.
Next steps for future work:
Thank you to our advisor Dr. Sam Ventura (and assistant Dominic Ventura) and our professor, Dr. Ron Yurko. We would also like to thank our TAs, Nick Kissel, Meg Ellingwood, Wanshan Li, Kenta Takatsu, and YJ Choe. Lastly, thank you to Professor Ben Baumer, Smith College and Dr. Ryne Vankrevelen, Elon University. We could not have completed this project without you all and we sincerely appreciate your help!
…